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Abstract 



^>* , Previous researches have demonstrated that the framework of dictionary learning with sparse coding, 

r \ ■ in which signals are decomposed as linear combinations of a few atoms of a learned dictionary, is well 

adept to reconstruction issues. This framework has also been used for discrimination tasks such as image 
classification. To achieve better performances of classification, experts develop several methods to learn 
a discriminative dictionary in a supervised manner. However, another issue is that when the data become 
extremely large in scale, these methods will be no longer effective as they are all batch-oriented approaches. 
For this reason, we propose a novel online algorithm for discriminative dictionary learning, dubbed ODDL 
in this paper. First, we introduce a linear classifier into the conventional dictionary learning formulation 
and derive a discriminative dictionary learning problem. Then, we exploit an online algorithm to solve the 
derived problem. Unlike the most existing approaches which update dictionary and classifier alternately 
via iteratively solving sub-problems, our approach directly explores them jointly. Meanwhile, it can largely 
shorten the runtime for training and is also particularly suitable for large-scale classification issues. To eval- 
uate the performance of the proposed ODDL approach in image recognition, we conduct some experiments 
on three well-known benchmarks, and the experimental results demonstrate ODDL is fairly promising for 
image classification tasks. 



1 Introduction 

Dictionary learning with sparse coding, which decompose signals as linear combinations of a few atoms from 
some basis or dictionary, have drawn extensive attentions in recent years. Researchers have demonstrated that 
this framework can achieve state-of-the-art performances in image processing tasks such as image denois- 
ing 0, face recognition ||22l|23, etc. Given a signal x G W 1 and a fixed dictionary D G R nxk which 
contains k atoms, we say that x admits a sparse representation over D, if we can find one sparse coeffi- 
cient ex G M fc which makes x ^ Da. As we know, predefined dictionaries, based on various types of 
wavelets |[T9l , are not suitable for many vision applications such as appearance-based image classification, 
because the atoms of these dictionaries do not make use of the semantic prior of the given signals. However, 
the learned dictionaries can achieve more promising performances in various image processing tasks than 
that of the predefined ones 



Several algorithms have been proposed for learning such dictionaries based on sparse representation recently. 
For example, K-SVD algorithm [l] is one such algorithm which learns an overcomplete dictionary from the 
training data. It updates the atoms in the dictionary one at a time, by fixing all the other atoms unchanged 
and finding a new one with its corresponding coefficients which minimize the mean square error (MSE). 



Researchers have shown that this algorithm can achieve outstanding performances in image compression and 
denosing 0[lO]]. However, K-SVD algorithm merely focuses on the reconstructive power of learned dictio- 
nary, thus it is intrinsically adapted for (image) discrimination or classification tasks. To address this problem 
and to make use of dictionary learning powerfulness, several methods have been proposed recently. For ex- 
ample, semi- supervised dictionaries fl22l are learned via updating the K-SVD dictionary based on results of 
a linear classifier iteratively. As well, by adding a linear classifier, another algorithm called discriminative 
K-SVD l27ll is developed for image classification. Moreover, to obtain the discriminative capability of the 
dictionary, a more sophisticated loss function called logistic loss function (softmax function for multiclass 
classification) is added to the classical dictionary formulation O [XT) . 

In addition, most recent methods for dictionary learning are iterative batch algorithms, which assess all the 
training samples at each iteration to minimize the objective function under sparse constraints. Therefore, 
another problem we may encounter is that when the training set becomes very large, these methods are no 
longer efficient. To overcome this bottleneck, an online algorithm for dictionary learning which applies 
block-coordinate descent method [15] has been proposed in the literature. However, this online dictionary 
learning method is still learning the reconstructive dictionary which can well represent the signals, but is not 
adapted for classification. Marial et al. attempt to address this issue by task-driven dictionary learning lfT3ll 
where supervised dictionaries are learned via a stochastic gradient descent algorithm. 

To overcome the above two problems, i.e. lacking discriminative power in the reconstructive dictionary and 
the issue caused by large-scale training set, we propose a novel online discriminative formulation for learning 
the discriminative dictionaries in a online manner. We name our approach ODDL in this paper. In our work, 
we first incorporate label information into the dictionary learning stage by adopting a linear classifier, and 
then formulate a supervised dictionary learning problem. To solve this problem, we propose a corresponding 
online algorithm, in which we apply the block-coordinate descent method to train the dictionary and clas- 
sifier simultaneously. Unlike most recent methods which update the dictionary and classifier alternately via 
iteratively exploring the solution of sub-problems, it directly learns the dictionary and classifier jointly. Fi- 
nally, we carry out some experiments on three well-known benchmarks to demonstrate the effectiveness of 
our proposed method, and the experimental results show the proposed ODDL method is fairly competitive 
for classification tasks. 

In summary, the main contributions of this paper include the following: 

• We propose a novel online algorithm with the numerical solution to learn a discriminative dictionary. 
It enables online framework and learning discriminative dictionary to merge into one framework. In 
other words, our proposed approach can efficiently and effectively derive the discriminative dictionary, 
meanwhile it overcomes large scale classification problem. 

• By analysis, we see our algorithm can update the classifier simultaneously with the update of the 
dictionary when a new training sample comes. By this way, computational cost can be significantly 
reduced. 

• As shown experimentally, our approach achieves encouraging performance compared with some other 
dictionary learning approaches. 

• Interestingly, we suggest a novel, efficient and effective dictionary construction scheme for face recog- 
nition. This scheme shows lights on face recognition experimentally. 

The paper is organized as follows. Section 2 introduces the basic formulation of dictionary learning and 
sparse representation for classification. Then our proposed approach is presented in Section 3, followed by 
the experimental results demonstrated in Section 4. Finally, we conclude our paper in Section 5. 
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Figure 1: Flows of three different dictionary learning schemes. From the top to bottom, the schematic 
illustration of dictionary learning methods are reconstructive (a), discriminative (b) and online (c). 

2 Related Work on Dictionary Learning Methods 

Recent researches have demonstrated that natural signals such as images can admit sparse representations 
of some redundant basi£] (also called dictionary). This phenomenon can explain the feasibility that image 
classification can be done by sparse representation with an overcomplete dictionary learned from the training 
images. In this section we briefly review three dictionary learning schemes which are closely relevant to our 
proposed method. Fig.[T]illustrates the flows of the three dictionary learning schemes with a classifier training 
process. 



2.1 Reconstructive Dictionary Learning for Classification 

In classical sparse coding problems, consider a signal x € M n and a dictionary D = [di, ..., dj £ R nx/c . 
Under the assumption that a natural signal can be approximately represented by a linear combination of a 



1 Here the term "basis" is loosely used, since the dictionary can be overcomplete and, even in the case of just complete, there is no 
guarantee of independence between the atoms. 



few selected atoms from the dictionary, x then can be represented by Da for some sparse coefficient vectors 
a G l fe . To find the sparse representation of x is equivalent to the following optimization problem: 

min ||x-Da|& s.t. ||a|| < L (1) 

where p is or 1. The £° pseudo norm sparse coding is an NP-hard problem [2] and several greedy algo- 
rithms EOl EU have been proposed to approximate the solution. The i 1 formulation of sparse coding is the 
well-known Lasso |25] or Basic Pursuit [6| problem and can be effectively solved by algorithms such as 
LARS (8). 

Eq.[T]is the classical reconstructive dictionary learning problem, in which overlapping patches instead of the 
whole images are sparsely decomposed as a result of the natural images are usually very large. For an image 
I, suppose there are M overlapping patches {xi} i=1 e M n from image I. Then the dictionary D G M nxfc is 
learned via alternatively solving the following optimization over D and A: 

M 

{D,A}= axgmin^||xi-Dai||*, s.t. \\ai\\ < L fori = 1, . . . , M, (2) 

Deir xfc i=i 

where A = [ai , ..., olm) € IR /eX M is the coefficient matrix, x^ is the i th patch of image I written as a column 
vector, OLi is the corresponding sparse code. Several algorithms have been proposed to solving this dictionary 
learning problem, such as Q and fUTl . 

Given C sets of signals Pi,i = 1,2, ...,C, which belong to C different classes. The training stage for 
classification based on sparse representations is composed of two independent parts: dictionary learning and 
classifier learning. First, a dictionary D of C classes is learned according to Q. Then, the classifier is trained 
via solving the following optimization problem: 

min/(Y ) W,A)+A||W|| 2 F (3) 

w 

where Y is the label matrix of the training pathes, A is the coefficient matrix computed on the learned 
dictionary D, and / is a loss function. However, this dictionary learning scheme has two main drawbacks, 
easily shown in Fig.[T](a): 

1. The dictionary training and classifier training are two independent stages. Thus, the learned dictionary 
cannot capture the most discriminative cues that are helpful for classification. 

2. Practically, to improve the representative capacity of the dictionary, we often exploit large-scale training 
samples to obtain a powerful dictionary in representation. But this action actually will fail to learn an effective 
dictionary, due to the large-scale dataset problem. 

2.2 Discriminative Dictionary Learning for Classification 

Researchers have already made some efforts to overcome the first drawback mentioned in previous subsection 
that the learned dictionaries lack discrimination power for classification. In HUE), a discriminative term is 
introduced to combine the classifier learning process with dictionary learning, and the final objective function 
is: 



A mi ) n v ^|||x i -Da||2 + C(y i -/(a J ,W))J+A 1 ||W||^ 
s.t. a.i < L for i = 1, . . . , M, 



where W is the classifier parameter, yi = 1 or —1 is the label of patch x^, and C is a logistic loss function, 
C(x) = log(l + e~ x ). In addition, in 1221 and |27l , a simpler term which is a linear classifier is considered 
for the discriminative power: 

min V <| Ilx* - Dall 2 + lly, - Wa, - bll 2 \ + X 1 llWlf 



W Z^ ) II * ""112 T" \\Ji - »u« - «||2 (• -r ^111 " llF' 



(5) 
s.t. a; < L for i = 1, . . . , M, 



where W and b are the classifier parameters, yi is the label vector of patch x^ in which the element associated 
with the class label is 1 and the others are 0. \\-\\ F denotes the Frobenius norm of a matrix X, i.e. ||X|| F = 
(EiEi^j 2 ) 1 ^ 2, Without generalization, the intercept b can be omitted by normalize all the signals. 

Dictionaries learned by these methods generally perform better in classification tasks than those learned in a 
reconstructive way. However, from Fig. [T](b), we can see a fatal drawback of these methods is that, if a new 
and important training sample comes after the dictionary has been learned, we have to relearn the dictionary 
from scratch. From another point of view, discriminant dictionary learning methods suffer from large-scale 
dataset problem. 

2.3 Online Dictionary Learning for Classification 

Large-scale training set is a reasonable extension from human beings in learning from experiences. But the 
aforementioned two dictionary learning schemes fail to handle large-scale dataset problem. For this reason, 
an online dictionary learning algorithm f[T5l turns up to an efficient dictionary learning paradigm for large- 
scale training set. Inspired by H, Mairal et al. use the expected objective function to replace the original 
empirical objective function, obtaining an novel dictionary learning problem: 



min-E x [||x-Da*||^] (6) 



where a* denotes the sparse coefficients computed in the sparse coding stage. To solve the above problem, 
they propose an online algorithm which applies the block-gradient descent method for dictionary updating. 
However, one obvious drawback of this algorithm is that it also ignores the valuable label information which 
will enhance classification performance. Furthermore, from the flow of training process reflected in Fig. [T] 
(c), another critical defect can be easily seen that even though the dictionary can be efficiently learned in an 
online manner, the classifier must be relearned from scratch when a new training sample comes. 



3 Online Discriminative Dictionary Learning 

In the previous section, we review three dictionary learning schemes with their respective drawbacks. Now 
we derive our online discriminative dictionary learning (ODDL) to overcome the mentioned defects. The 
schematic flow chart is demonstrated in Fig. [2] from which we can see the obvious difference from the 
aforementioned three schemes. 



3.1 Proposed Formulation 

To overcome the issue lack of discriminative information for learned dictionary, we introduce an discrimina- 
tive term to the original dictionary learning problem. In this paper, we consider the linear classifier for its 
simplicity. Adding the linear classifier, we obtain the following problem: 

min ||X - DA|| F +A ||Y - WA|| 2 F +Ai ||W|| F , (7) 
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Figure 2: The flow of our proposed dictionary learning algorithm. 



where X is the patch matrix, Y is the label matrix of the training patches, ||X — DA|| F is the reconstructive 
error term, || Y — WA|| F is the discriminative term, and Ao controls the trade-off between the reconstructive 
and discriminative terms. 

Now we need to address another issue about the large-scale dataset problem, as Bottou et al. H say, the 
minimization of the empirical cost is not the focus of researchers, but instead the minimization of the expected 
cost: 



min E x , y [||x - Da*||^ + A ||y - Wa*||^] + AJW 



D,W 



I 2 , 



(8) 



where the expectation is taken with respect to the joint distribution of (x, y). In practice, to improve the 
representative power of learned dictionaries, a large amount of training data is always needed. For example, 
when applying dictionary for image processing tasks, the number of training patches can be up to several 
millions in a single image. In this case, we must exploit an efficient technique to solve this large-scale dataset 
problem and online learning is such a technique. 



3.2 Optimization 



In this subsection, we briefly introduce an online discriminative dictionary learning algorithm to solve the 
proposed formulation © in the previous subsection. As same as most existing dictionary learning algorithms, 
there are still two stages in our proposed algorithm. 

Sparse coding The sparse coding problem (OQ) with learned dictionary D is an £ p norm optimization problem, 
where p is or 1 . Several algorithms have been proposed for solving this problem. In this paper, we choose 
the £° pseudo norm optimization problem as our sparse coding problem since in this formulation we can 
explicitly control the sparsity (nonzero elements) of the coefficients of the signals projected on the learned 
dictionary. This leads us to use the Orthogonal Matching Pursuit (OMP) algorithm |2H . a greedy algorithm 
which selects atoms with highest correlation to current orthogonal projected residual sequentially. 

Dictionary and classifier updating This stage is markedly different from that of other discriminative dictio- 
nary learning approaches. In our proposed ODDL, we use the block-coordinate descent method for updating 
dictionary and classifier jointly, while the usual strategy in other algorithms consists of finding the approxi- 
mate global solutions of dictionary and classifier via solving sub-problem iteratively. 



Rewrite Eq.[7]and we can derive a compact formulation as our objective function: 

2 



min E x v I 

D,W ,JM 
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] + Ai||W| 



F- 



(9) 



Note that from a dictionary learning viewpoint, the "dictionary" ( v ^- w ), which represents the "signal" 
( /j- ), is always assumed to be normalized column- wise in updating process, i.e. the Euclidian length of 
columns in the "dictionary" is 1. Moreover, the real dictionary D we derive is also normalized, therefore, we 
can drop the regularization term || W|| F in the objective function. Thus, we derive the final function: 



min E x v I 

D,W '*' 



X 



D 



(10) 



In our algorithm, there is an important assumption that the training set is composed of i.i.d. samples (x, y) 
which admits a probability distribution p(x, y). Using the same strategy in stochastic gradient descent, our 
algorithm draws one sample (x^, y t ) at each iteration, and computes the sparse code cx t of x t on the previous 
dictionary D^_i, then updates dictionary D^ and classifier parameter W t simultaneously via solving the 
following problem 



min V^ 



A^y;/ 



D 



A W 



(11) 



as 



To address this problem, first we denote ( £± ) as x^ and ( /^ w ) as D. Then problem ([TIT) can be rewritten 

(12) 



min Tr(D T DM t ) - 2Tr(D T N t ), 
t t 



where M t = ^ ociCXi T , N t = ]T x» cx { T 

i=l i=l 

Using the block-coordinate descent method, the j-th column of D can be updated using 

-x 1 



m 



-(uj -Dm^+dj 



j j 



Then parting dj and w^ off dj we can update dj and Wj by 









(13) 



(14) 



The details of derivation are showed in Appendix. 



3.3 Algorithm 

The approach we propose in this paper is a block-coordinate descent algorithm, and the overall algorithm 
is summarized in Algorithm [T] In this algorithm, the i.i.d. samples (x t ,y^) are drawn from an unknown 
probability distribution p(x, y) sequentially. However, since the distribution p(x, y) is unknown, obtaining 
such i.i.d. samples may be very difficult. The common trick in online algorithms to obtain such i.i.d. samples 
is to cycle over a randomly permuted training set . The convergence of the overall algorithm is proved 
empirically and theoretically [ 15 1. We do not elaborate the proofs as the main contribution is not in the proof, 
and interested readers are encouraged to refer this paper [15], where the proofs have been already available. 

Initialization. The initialization of dictionary D and classifier W plays an important role in our proposed 
method. It may lead to poor performances if they are not well initialized. One can use patches randomly 



Algorithm 1 The online discriminative algorithm for dictionary learning 

Input: (x, y) £ R n x R q ~ p(x, y) (random variables and a method to draw i.i.d samples of p), Ao £ 

(regularization parameters), L £ R (sparsity factor), T (number of iterations). 

Output: Dictionary D and classifier parameter W. 

1: Initialize the dictionary D and classifier W. 

2: Set M eR kxk =0, and N g M (n+(?)x/c =0 

3: while stop criterion is not reached or t = 1 to T do 

4: Draw (x t , yt) fromp(x, y) 

5: Sparse coding: compute (x t via solving the following optimization problem: 

1 2 

OL t =argmin- ||x* - Da|| 2 , s.t. ||a|| < L 
cxeR k z 

M t = M t -1 + QL t OL t T . 

N t = N t _i+x^a t T . 

Update the parameters D and W by a block-coordinate descent method in Algorithmic 
9: Normalize the columns of D such that the £ 2 norm of each column equals to 1 . 
10: end while 
11: Return D and W 

Algorithm 2 Dictionary and classifier parameter update 



Input: D t _i £ R(n+q)xk 9 Mf e R kxk^ Nt e R (n+g)x/c < 

Output: D t andW t . 



repeat 

for / = 1 to k do 

Update the l-th columns of D t using 



1 <j-i 



m ll 

where the superscript t — 1 denotes the results from the (t — l)-th iteration. 
4: Separate d^ as d^ and w/ . 

5: Update d/ and wj using 



112 

w^ = w z /||d z || 2 



end for 
until convergence 

Return D t and W t 



selected from the training data and zero matrix to initialize D and W respectively. In practice, our experi- 
ments show that using the classical reconstructive dictionary as our initial dictionary D always lead to better 
performances than that of original patches from the training data. Using this initial dictionary D, the classifier 
W can be initialized via solving the optimization problem ©. 

Mini-batch strategy. The convergence speed of our algorithm can be improved with a mini-batch strategy, 
which is widely used in stochastic gradient descent algorithms. The mini-batch strategy draws more than one 
samples (denote the number of samples as k) at each iteration instead of a signal one. This is inspired by 
the fact the runtime for solving k £° pseudo norm optimization problem (Q]) with dictionary D can be greatly 
shorten using Batch-OMP algorithm [24] with precomputation of matrix D T D. 



4 Experiment 

In this section^ we demonstrate the performance of our proposed ODDL method in two image classification 
tasks, handwritten digit recognition and face recognition. Before presenting the experiments, we first discuss 
the choices of three important parameters in our algorithm. 

4.1 Choices of Parameters 

Parameter L. As introduced in the previous section, in our algorithm we choose the £° pseudo norm opti- 
mization problem as our sparse coding problem and use the Orthogonal Matching Pursuit (OMP) algorithm 
to find the approximative solutions. The sparsity prior L controls the nonzero elements of the sparse coeffi- 
cients in our algorithm. Our experiments have shown that handwritten digit images and face images can be 
represented well when L are 5 and 15 respectively. 

Parameter Ao. Ao is the parameter controlling the trade-off between the reconstructive and discriminative 
power in our method. Ao of large values will pay most attention to the reconstructive error, while small Ao 
would enhance the discriminative power at the cost of losing the representation ability. Thus, the value of 
Ao plays an important role for balancing representation and classification. In practice, the value Ao = 1 has 
given good performances in our experiments. 

Parameter T. In our method, we cycle over a randomly permuted training set which is a common technique 
in online algorithms to obtain i.i.d. samples for experiments. We have observed that when T is such a value 
that the whole training set is cycled one round the experimental results are always good. 

4.2 Handwritten Digit Recognition 

In this section we present experiments on the MNIST lfT4l and USPS 17] handwritten digit datasets. MNIST 
contains a total number of 70000 images of size 28 x 28, in which there are 60000 images for training and 
10000 images for testing. USPS contains 7291 training images and 2007 testing images of size 16 x 16. 

All the digit images are vectored and normalized to have zero mean and unit £ 2 norm. Using these two 
datasets, we test four methods: our proposed ODDL method, ksvd method with a linear classifier, dubbed 
ksvd-linear, online reconstructive dictionary learning method with a linear classifier, dubbed online-rec- 
linear, and dksvd (referred to fl27l ) method. In ODDL and dksvd methods, we learn a signal dictionary D 
with 960 atoms, corresponding to roughly 96 atoms each class, and a signal classifier. While for ksvd-linear 
and online-rec-linear methods, first 10 independent dictionaries each with 96 atoms are learned, one for each 
class. Then, we adopt the one-vs-all strategy ll23l for learning classifiers. For class i, the one-vs-all strategy 
uses all samples from class i as the positive samples and samples from the other classes as the negative 
samples to train the classifier of class i. 

The average error rates of four testing methods on MNIST and USPS are shown in Table [T] From the results, 
we can see that learning dictionaries in a discriminative way lead to better performance than those learned 
in a reconstructive way when adapted to classification task. When compared with those methods which use 
more sophisticated classifier models such as linear and bilinear logistic loss functions, our proposed method 
does not perform better. We believe that one of the main reasons is due to the simplicity of our linear classifier 
model. Our proposed method provides a new strategy for online discriminative dictionary learning, and the 
great strength is that in our framework the dictionary and classifier can be updated jointly, markedly different 
from the strategy of dictionary and classifier training in most existing methods. Figure [3] shows dictionaries 
of the USPS dataset, which are learned via ksvd-linear and ODDL methods respectively. 



2 Our propose ODDL method is an online approach, therefore testing on a large scale database is a requisite to evaluate the perfor- 
mance. However, the large-scale database evaluation is under way and we plan to report it along with one of our future work. 



Table 1 : Average error rates of testing methods for the MNIST and USPS datasets. 



Method 


MNIST 


USPS 


ODDL 


3.58 


5.35 


ksvd-linear 


5.07 


7.12 


online-rec-linear 


5.32 
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dksvd 


4.58 


6.53 
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Figure 3: Above: the learned dictionary in a reconstructive manner. Below: the learned dictionary by our 
ODDL method. 
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Table 2: Average runtime (s) for training stage using our proposed method and the ksvd-linear method. 



Method 


MNIST 


USPS 


ODDL 


156 


23 


ksvd-linear 


583 


62 



Table 3: Average error rates of our proposed method for the MNIST and USPS datasets with different values 
of dictionary size k. 



k 


160 


320 


640 


960 


1280 


2560 


MNIST 


5.49 


4.76 


4.02 


3.58 


3.92 


4.38 


USPS 


7.63 


6.43 


5.78 


5.35 


5.69 


6.24 



In addition, we also compare the runtime of our ODDL method and the ksvd-linear method for dictionary 
and classifier training. We take the total time for learning dictionaries and classifiers for all classes, then 
computed the average runtime via dividing it by the number of classes. The results are shown in Table [2] 
From Table O we can see our proposed ODDL can largely shorten the runtime for dictionary and classifier 
learning compared with the ksvd-linear method with the same dictionary size. 

To study the role of the dictionary size in our method, we proceed another set of experiments. We learn 
dictionaries from the training set with different sizes k in {160, 320, 640, 960, 1280, 2560}, and record the 
performances of these dictionaries on the testing set. The results are shown in Table [3] We observe that 
the dictionary size plays an important role in classification task. If k is too small, information in learned 
dictionaries is not sufficient for discriminative. When k is too big, learned dictionaries contain too much 
redundant information which may influence discrimination. 

4.3 Extended YaleB Face Recognition 

The Extended YaleB face dataset (V2\ consists of 2414 near frontal face images of 38 individuals. These 
images are taken with different poses and under different illumination conditions. We randomly divide the 
dataset into two parts, and each part contains approximate 26 samples. One is used for learning the dictionary 
and classifier, while the other is used as the testing set. Before presenting our experiments, we need some 
pre-processing steps. As known, the most important features in face recognition are eyebrows, eyes, nose, 
mouse, and chin. Using this information, we divide each face image into four non-overlapping patches from 
top to bottom, and into three non-overlapping patches from left to right. Figure |4] shows such patches. We can 
observe that each patch contains at least one feature. After doing this, for each person we have seven classes 
of patches. Then we vector all the patches and normalized them to have unit £ 2 norm. In our experiments, 
seven dictionaries with 228 atoms and seven classifiers are learned, corresponding to seven patch class. 

For comparison, we also test our proposed method with ksvd-linear, online-linear, and dksvd methods. 
The results are demonstrated in Table |4] It is easy to see that discriminative dictionary performs better than 
reconstructive dictionaries. Figure |5]plots the dictionaries learned by our ODDL method for two individuals. 



Table 4: Average error rates of testing methods for the Extended YaleB face dataset. 



ODDL 


ksvd-linear 


online-linear 


dksvd 


1.09 


2.03 


2.24 


1.76 
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Figure 4: Original patches drawn from face images. 
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Figure 5: Dictionaries learned via our proposed ODDL. Here we manually rearrange the seven learned dic- 
tionaries to two big dictionaries. 
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Table 5: Average runtime (s) for training stage using our proposed method and the ksvd-linear method. 



ODDL (304) 


ODDL (608) 


ksvd-linear 


3 


4 


10 



As in the experiments with handwritten digit datasets, we also compared the average runtime of training stage 
of our proposed ODDL method and the ksvd-linear method. Table [5] shows the final results. For the ksvd- 
linear method, the dictionary size is 6 for each patch class of each person. For our proposed method, we test 
the average runtime of training stage when the dictionary sizes are 228 and 456 respectively. As expected, 
learning dictionaries with smaller size can shorten the runtime. 



5 Conclusion and Future Work 



In this paper, we propose a novel framework for online discriminative dictionary learning (ODDL) for image 
classification task. By introducing a linear classifier into the conventional dictionary learning problem, the 
learned dictionary will capture the discriminative cues for classification along with representation powerful- 
ness for reconstruction. We propose an online algorithm to solve this discriminative problem. Unlike other 
algorithms which find the dictionary and classifier alternately via solving the sub-problems iteratively, our al- 
gorithm directly finds them jointly. The experimental results on MNIST and USPS handwritten digit datasets 
and the Extended yaleB face dataset demonstrate that our method is very competitive when applied to image 
classification task with large-scale training set. More experiments need to be done to better demonstrate the 
performances of our proposed methods for image classification in the future. 
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A Appendix 



To obtain (fT2l) . denote /(D, W) as the function to minimize in (fTTT) . then a bit of algebra gives 

2 



/(D,W) = 



H 



DA f 



Tr[(X t -DA t ) (X t -DA t )] 
Tr(D T DM t ) - 2Tr(D T N t ) + Tr(XfX t ) 



(15) 



where X t = [xi, ..., xt], and A t = [ai, ..., at]. Since the last term of the final formulation is irrespective of 
D and W, dropping it then we can obtain (fT2l) . 

In order to obtain the update of dj , the j-th column of D, a block-coordinate descent method is used. Denote 
the objective function in (fT2l) as /(D), then using some algebraic transformations we obtain 



/(d) = e df E d < m *< - 2 E d ^ 



(16) 
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Now consider only the terms associated with cl,, which we denote as f(dj) 

f(Aj) = (dj E dirriji + djdjrrijj + E df d^m^-)- 

= (djdj-m^ + dj E dirriji + E d f d^ra^)- 

2djn j _ _ (17) 

= (djdjrrijj + dj E dzm^z + d J E d^m^)- 

l d L n * 

= djdjrrijj + 2df E d/ra^ - 2djiij 

Notice in above transformations we use an important information that the matrix M. t is symmetric. Comput- 
ing the derivative of f(dj) with respect to dj we have 



dfidj) 



Thus setting the above derivative to 0, dj can be updated 



2rrijjdj + 2 V^d^m^ — 2nj (18) 



d, 



nj-Y^dimu+djmjj (19) 



= ^-( n i- Dm i) + d j 
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